Motivation

New to New York City, our group members are all shocked by the high cost of building rentals. This encourages us to explore more about the property market and the buildings’ price in NYC.

We are curious about the distribution of buildings of each pricing level in time and geographical zones. Also, we want to find factors that may have influence on buildings value. Based on which, we can further modeling and predicting the buildings price in New York.

Data Source

To attain the price information of buildings, we use the dataset published by the Department of Finance of NYC. The origin data can be acquired from NYC Open Data

This dataset contains the income and expense statements of condominium and comparable rental buildings in NYC that helps people estimate the marked value of their own buildings. In each observations, a condominium/cooperative is compared with three rental buildings that have similar physical and geographical features. We equaly treat these two kind of buildings and transformed the data shape to ensure that each observation records exactly one building.

Preprocessing

EDA

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(plotly)
## 
## 载入程辑包:'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(patchwork)
library(kableExtra)
## 
## 载入程辑包:'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(glmnet)
## 载入需要的程辑包:Matrix
## 
## 载入程辑包:'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-4
library(rgl)

knitr::opts_chunk$set(
  fig.height = 6,
  fig.width = 8
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis",
  digits = 3
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d

load("data/cleaned_data.RData")

df = 
  transformed_rental_income %>% 
  separate(boro_block_lot, into = c("borough", "block", "lot"), sep = "-") %>% 
  mutate(
    borough = case_when(
      borough == "1" ~ "Manhattan",
      borough == "2" ~ "Bronx",
      borough == "3" ~ "Brooklyn",
      borough == "4" ~ "Queens",
      borough == "5" ~ "Staten Island")
    )

df %>% 
  mutate(borough = fct_reorder(borough, full_market_value)) %>% 
  group_by(borough) %>%
  summarise(
    observations = n()
  ) %>%
  plot_ly(y = ~borough, 
          x = ~observations, 
          color = ~borough, 
          type = "bar", 
          colors = "viridis",
          orientation = 'h') %>%
  layout(title = 'Records in each Borough')
df %>% 
  mutate(borough = fct_reorder(borough, full_market_value)) %>% 
  plot_ly(y = ~log(full_market_value), color = ~borough, type = "box", colors = "viridis") %>%
  layout(
    xaxis = list(title = 'Borough'),
    yaxis = list(title = 'log(Market value)')
  )

According to the histogram plot and box plot, we can see that the order of mean value is Manhattan, Queens, Brooklyn, Bronx and Staten Island from the highest to the lowest. Both number and mean value of rentals in Manhattan take the head in five boroughs.

Take a overview on the trends of rentals over time. (stratified by boroughs)

ddf = 
  df %>% 
  mutate(classification = substr(building_classification, 4, 12))

line_df = ddf %>% 
  group_by(borough, year_built) %>% 
  summarise(mean_value_year = mean(full_market_value)) %>% 
  ggplot(aes(x = year_built, y = mean_value_year, color = borough))+
  geom_line()+ 
  theme_bw()+
  facet_wrap(~borough, ncol=3)+ 
  scale_colour_manual(values=c("#D9B489", "#D49F3A", "#BC3D1C", "#4C372D", "#E8E3B9"))
## `summarise()` has grouped output by 'borough'. You can override using the
## `.groups` argument.
fig <- ggplotly(line_df)

fig

correlation plot between variables

dddf = 
  ddf %>% 
  select(-report_year, -boro_block, -lot, -building_classification, -address, -zoning)
# Plot the correlation
corr = data.frame(lapply(lapply(dddf, as.factor), as.numeric))
corrplot(cor(corr), type = "lower", method = 'color', tl.col = "black",tl.cex=0.8, tl.srt=50)

Skewness value: it can be used to measure the asymmetry of probability distribution of random variables.

Kurtosis value: it can be used to measure the steepness of the probability distribution of random variables.

View the rent distribution curve and draw a curve distribution diagram and rent scatter diagram、

First, observe the rent data. It can be seen that this data is in line with the normal distribution, but the skewness value is too large. We found a big tail. The “rent” value distribution has obvious skewness, so we will correct it later.

  1. View the data histogram of each property of the apartment

Next, let’s take a look at the impact of some important attributes on the results The histogram is used to show the data distribution. Generally speaking, it refers to which piece of data accounts for a high proportion or number of occurrences, and which piece has a low probability of occurrence. The following figure shows the occurrence of 18 attributes

Continuous variables include: community name (Cname), rent_quantity, total floors, position, subway_station, distance, rent, and the rest are discrete variables.

  1. For discrete variables, we use boxplot to represent (box plot) Discrete variables include time, floor, space, state, bedroom_num, hall_num, toilet _num Rent_style, area, subway_line, decoration.

Box diagram of bedroom number and rent

Box diagram of living room number and rent

Box diagram of area and subway_line

  1. Correlation analysis

Analyze the correlation degree between different factors affecting housing price and housing price, and analyze the correlation degree through the correlation graph. Qualitative and visual analysis of the correlation between different factors affecting housing price and housing price with bar chart

Spacial & Statistical Analysis

Infomation on map

library(tidyverse)
library(plotly)
library(patchwork)

load("data/cleaned_data.RData")
load("data/Manhattan_latlonmap.RData")


transformed_rental_income=
  transformed_rental_income %>%
  separate(boro_block_lot, into = c("borough", "block", "lot"), sep = "-") %>%
  mutate(
  borough = case_when(
    borough == "1" ~ "Manhattan",
    borough == "2" ~ "Bronx",
    borough == "3" ~ "Brooklyn",
    borough == "4" ~ "Queens",
    borough == "5" ~ "Staten Island")
  ) %>% 
  filter(
    borough == "Manhattan"
  )

Manhattan_data = Manhattan_data %>%
  separate(address, into = c("address", "toremove"), sep = ",") %>%
  select(-toremove)

transformed_rental_income = 
  left_join(transformed_rental_income, Manhattan_data, by = "address")

## Columbia university lat lon 40.80754 -73.96257
## Hammer health science building lat lon 40.84252 -73.94243 

for_map =  transformed_rental_income %>% 
  filter(report_year > 2019) %>%
  select(-c(latitude,longitude)) 

income_sqft_M = for_map %>%
  plot_mapbox(
    lon = ~lon,
    lat = ~lat,
    color = ~gross_income_per_sq_ft,
    mode = 'markers',
    alpha = 0.5,
    hovertext = ~gross_income_per_sq_ft,
    hoverinfo = 'text',
    colors = "Spectral"
  ) %>%
  layout(
    mapbox = list(
      style = "streets",
      zoom = 10,
      center = list(lat = 40.81, lon = -73.96)
    )
  )

yearsbuild_M = for_map %>%
  plot_mapbox(
    lon = ~lon,
    lat = ~lat,
    color = ~year_built,
    mode = 'markers',
    alpha = 0.5,
    hovertext = ~year_built,
    hoverinfo = 'text',
    colors = c('red','yellow','green')
  ) %>%
  layout(
    mapbox = list(
      style = "streets",
      zoom = 10,
      center = list(lat = 40.81, lon = -73.96)
    )
  )

elevator_M = for_map %>%
  plot_mapbox(
    lon = ~lon,
    lat = ~lat,
    color = ~building_classification,
    mode = 'markers',
    alpha = 1,
    hovertext = ~building_classification,
    hoverinfo = 'text'
  ) %>%
  layout(
    mapbox = list(
      style = "streets",
      zoom = 10,
      center = list(lat = 40.81, lon = -73.96)
    )
  )


subplot(income_sqft_M, yearsbuild_M,elevator_M, nrows = 1, margin = 0.01) %>% layout(showlegend = FALSE, title = 'Income/sqft - Year built - Classification')

Visualized information of gross_income_per_sq_ft, year_built and Classification on map. The three plots are in same size and version, providing a comparable overview. We then performed statistical analysis to test their relationships to the price.

One Way ANOVA on Elevator Facilities

for_map %>%
  ggplot(aes(x =building_classification, y = gross_income_per_sq_ft,fill =building_classification))+
  geom_boxplot()+
  labs(title = "Elevator Facilities")+
  theme(
        axis.text.x = element_text(angle = 90)
      )+
  guides(fill=F)

According to the box plot, we have few records on C8-WALK-UP and C9-WALK-UP, so we filtered them out in the anova test.

for_anova = for_map %>%
  filter(
  building_classification!= "C8-WALK-UP",
  building_classification!= "C9-WALK-UP"
  )%>%
  select(c(building_classification,gross_income_per_sq_ft))
  
res = lm(gross_income_per_sq_ft ~ factor(building_classification), data = for_anova)
anova(res)
## Analysis of Variance Table
## 
## Response: gross_income_per_sq_ft
##                                    Df  Sum Sq Mean Sq F value Pr(>F)    
## factor(building_classification)    14  280558   20040     163 <2e-16 ***
## Residuals                       18479 2276904     123                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this one way anova test, our null hypothesis is that the average price of each kinds of building are the same. Given a p-value much smaller than 0.05, we reject the null hypothesis and conclude that there are differences in the average price of each kinds of building. This indicates that the building classification is related to the rental price.

The detailed relationship across each types can be further checked by paired t-test.

SLR on Years built

for_map %>%
  ggplot(aes(x = gross_income_per_sq_ft))+
  geom_histogram() +
  labs(
    title = "Price Distribution"
  )

fit1 = lm(gross_income_per_sq_ft ~ year_built, data = for_map)
summary(fit1)
## 
## Call:
## lm(formula = gross_income_per_sq_ft ~ year_built, data = for_map)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.29  -6.90   2.25   7.86  42.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.30149    6.19213    -3.6  0.00032 ***
## year_built    0.03227    0.00321    10.1  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.7 on 18494 degrees of freedom
## Multiple R-squared:  0.00544,    Adjusted R-squared:  0.00539 
## F-statistic:  101 on 1 and 18494 DF,  p-value: <2e-16
par(mfrow=c(1,2))
plot(fit1,which = 1) 
plot(fit1,which = 2)

We applied SLR to test the significance of the years built. Given the null hypothesis that the coefficient of years built term is 0, p-value is small and we can reject the null. However, the estimated coefficient is small and the price seems not normally distributed, which may lead to inaccurate result.

Rentals near Columbia

We plotted the buildings information near Columbia campus to help with students choosing their apartment.

latCU = 40.80754
lonCU = -73.96257
latCUIMC = 40.84252
lonCUIMC = -73.94243

near_columbia = for_map %>%
  filter(
    lat >  latCU - 0.01,
    lat < latCU + 0.01,
    lon > lonCU -0.005,
    lon < lonCU + 0.005
    ) %>%
  plot_mapbox(
    lon = ~lon,
    lat = ~lat,
    color = ~gross_income_per_sq_ft,
    mode = 'markers',
    alpha = 1,
    hovertext = ~address,
    text = ~gross_income_per_sq_ft,
    meta = ~building_classification,
    hovertemplate = "%{hovertext} <br> %{meta} <br>gross_income_per_sq_ft: %{text}",
    hoverinfo = 'text',
    colors = "Spectral"
  )  %>%
  layout(
    mapbox = list(
      style = "streets",
      zoom = 13,
      center = list(lat = latCU, lon = lonCU)
    )
  )


near_CUIMC = for_map %>% 
  filter(
    lat >  latCUIMC - 0.01,
    lat < latCUIMC + 0.01,
    lon > lonCUIMC -0.005,
    lon < lonCUIMC + 0.005    
  )%>%
  plot_mapbox(
    lon = ~lon,
    lat = ~lat,
    color = ~gross_income_per_sq_ft,
    mode = 'markers',
    alpha = 1,
    hovertext = ~address,
    text = ~gross_income_per_sq_ft,
    meta = ~building_classification,
    hovertemplate = "%{hovertext} <br> %{meta} <br>gross_income_per_sq_ft: %{text}",
    hoverinfo = 'text',
    colors = "Spectral"
  ) %>%
  layout(
    mapbox = list(
      style = "streets",
      zoom = 13,
      center = list(lat = latCUIMC, lon = lonCUIMC)
    )
  )

subplot(near_columbia, near_CUIMC, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.01) %>% layout(title = 'Rentals near Columbia Campus')
rm (list = ls ())

Modeling

Data

Through previous data exploration and data analysis, we have learned that there are some variables in our data set strongly associated with the estimated gross income.

Therefore, we decided to dig deeper and use some of these variables to predict estimated gross income.

The variables involved are as followed:

 ·estimated_gross_income (dependent): Estimated income from the building

 ·latitude: Latitude of the building

 ·longitude: Longitude of the building

 ·type: Walk-up/Elevators

 ·borough: 1 ~ Manhattan, 2 ~ The Bronx, 3 ~ Brooklyn, 4 ~ Queens, 5 ~ Staten Island

 ·total_units: Total number of units in the building

 ·year_built: The year the building was built

 .gross_sq_ft: Gross square footage of the building

However, in the preliminary analysis, we found that some of them do not meet the normal distribution, so we use logarithm operation and the their distribution are fixed:

load('./data/cleaned_data.RData')

data_1 = transformed_rental_income %>%
  separate(boro_block_lot, into = c("borough", "block", "lot"), sep = "-") %>%
  separate(building_classification, into = c("code", "type", "type_1"), sep = "-") %>%
  mutate(borough = factor(borough),
         type = factor(type)) 
  
# make density plots of every variable to check its normality

p1 = data_1 %>%
  ggplot(aes(x = latitude)) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#F0E442",
                 alpha = 0.75,
                 bins = 10)

p2 = data_1 %>%
  ggplot(aes(x = longitude)) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#0072B2",
                 alpha = 0.75,
                 bins = 10)

# Logarithmic operation
p3 = data_1 %>%
  ggplot(aes(x = log(total_units))) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#CC79A7",
                 alpha = 0.75,
                 bins = 10)

p4 = data_1 %>%
  ggplot(aes(x = year_built)) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#D55E00",
                 alpha = 0.75,
                 bins = 10)
# Logarithmic operation
p5 = data_1 %>%
  ggplot(aes(x = log(gross_sq_ft))) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#56B4E9",
                 alpha = 0.75,
                 bins = 10)
# Logarithmic operation
p6 = data_1 %>%
  ggplot(aes(x = log(estimated_gross_income))) +
  geom_density(aes(y = after_stat(density)),
                 fill = "#009E73",
                 alpha = 0.75,
                 bins = 10)


p1 + p2 + p3 + p4 + p5 + p6 

Fitting Model

We decide to use latitude and longtitude with lowess (locally weighted regression scatter plot smoothing) for smoothing model, so other 5 variables are fitted initially.

#### log_estimated_gross_income

data_egi = data_1 %>%
  mutate(
    log_total_units = log(total_units),
    log_gross_sq_ft = log(gross_sq_ft),
    log_estimated_gross_income = log(estimated_gross_income)
  ) %>%
  select(log_estimated_gross_income, type, borough,
         log_total_units, year_built ,log_gross_sq_ft)

To find the best model, we used stepwise regression procedure. That is to say, we start with no predictors, then sequentially add the most contributive predictors (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection)

Model without predictors

# intercept only model
io_eg = lm(log_estimated_gross_income ~ 1, data = data_egi)

\[ \widehat{log\_estimated\_gross\_income} = 13.9425 \]

Model with all of the predictors

# model with all predictors
all_eg = lm(log_estimated_gross_income ~ ., data = data_egi)
# formula
equatiomatic::extract_eq(all_eg, use_coefs = TRUE, coef_digits = 4)

\[ \operatorname{\widehat{log\_estimated\_gross\_income}} = 1.9014 - 0.0484(\operatorname{type}_{\operatorname{WALK}}) - 0.9161(\operatorname{borough}_{\operatorname{2}}) - 0.6698(\operatorname{borough}_{\operatorname{3}}) - 0.6618(\operatorname{borough}_{\operatorname{4}}) - 0.797(\operatorname{borough}_{\operatorname{5}}) - 0.0011(\operatorname{\log\_total\_units}) + 0.001(\operatorname{year\_built}) + 0.9741(\operatorname{\log\_gross\_sq\_ft}) \]

The best model

# stepwise procedure
egi_fit = step(io_eg, direction = "both", scope = formula(all_eg), trace = 0)

# formula
equatiomatic::extract_eq(egi_fit, use_coefs = TRUE, coef_digits = 4)

\[ \operatorname{\widehat{log\_estimated\_gross\_income}} = 1.9139 + 0.9732(\operatorname{\log\_gross\_sq\_ft}) - 0.916(\operatorname{borough}_{\operatorname{2}}) - 0.6697(\operatorname{borough}_{\operatorname{3}}) - 0.6619(\operatorname{borough}_{\operatorname{4}}) - 0.7969(\operatorname{borough}_{\operatorname{5}}) + 0.001(\operatorname{year\_built}) - 0.0485(\operatorname{type}_{\operatorname{WALK}}) \]

Coefficients for best model

egi_fit %>% broom::tidy() %>% knitr::kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 1.9139 0.0723 26.5 0
log_gross_sq_ft 0.9732 0.0011 908.3 0
borough2 -0.9160 0.0028 -328.5 0
borough3 -0.6697 0.0021 -316.8 0
borough4 -0.6619 0.0022 -307.7 0
borough5 -0.7969 0.0109 -73.3 0
year_built 0.0010 0.0000 25.3 0
typeWALK -0.0485 0.0022 -22.1 0

Model checking

# check the model
performance::check_model(egi_fit, check = c("outliers", "qq", "normality"))

From above we know that log_total_units is left out in the fitting procedure, and the checking plots indicate that our model is well done for the sake of normality and outliers.

Predicting Model using LOWESS method on location

From this part, the model didn’t use the location (longitude and latitude) of the building. Obviouly, there is no linear relationship between location and gross income. Thus, we use lowess method to approximate the influence of location on rental price.

Here, we are interested in the predicting the “gross income per squre feet” variable. The modeling process is as follows:

  • Fittting a linear model without longitude and latitude
  • Take the residual of the model into 2-dimensional LOWESS smoothing with longitude and latitude.
  • For a new datapoint, – pipe into the linear model to get the linear prediction – Use the smoothed model to estimate the location residual – Add up the two terms to get the final prediction.

Building the partial-LOESS model

First, we take the dataset and fit a linear model without longitude and latitude, and save the residual for the following LOESS method.

rm(list=ls())
load("data/cleaned_data.RData")

transformed_rental_income = 
  transformed_rental_income %>%
  mutate(
    is_elevator = str_detect(building_classification,"ELEVATOR")
  )

linear_model = lm(gross_income_per_sq_ft~ is_elevator+total_units+year_built+report_year+gross_sq_ft, data=transformed_rental_income)

transformed_rental_income = 
  transformed_rental_income %>%
  mutate(
    resid_linear = linear_model$residuals
  )  %>%
  group_by(address) %>%
  summarize(
    gross_income_per_sq_ft = mean(gross_income_per_sq_ft),
    resid_linear = mean(resid_linear),
    longitude = mean(longitude),
    latitude = mean(latitude),
    is_elevator = as.logical(mean(is_elevator)),
    total_units = mean(total_units),
    year_built = mean(year_built),
    report_year = mean(report_year),
    gross_sq_ft = mean(gross_sq_ft)
  )

Then, we applied LOESS smoothing on the residual by location. Here, we chose span = 0.05 because gross_income may change quickly just by a few blocks, and our dataset is large enough to capture such sudden changes. The smoothed residual is as follows:

lowess_surf = loess(resid_linear~latitude+longitude,data=transformed_rental_income,span = 0.05)

transformed_rental_income=
  transformed_rental_income %>%
  mutate(
  resid_smooth = lowess_surf$fitted
)  

smooth_location_resid = transformed_rental_income %>%
  select(longitude,latitude,resid_smooth)

transformed_rental_income %>%
  ggplot(aes(x=longitude,y=latitude,color=resid_smooth)) +
  geom_point()

Predicting function and Model Validation

The residual of the new point is estimated by a local linear model with its n_neighbours = 20 neighbours. we don’t need a large number of neighbors because the curve is already smoothed and niose is removed. Based on that, we can combine a local estimator and the linear model together using the following function:

n_neighbours = 20
predict_rental = function(new_data){
  if(nrow(new_data)!=1){
    return("nrow must be 1")
  }
  location_resid =
    smooth_location_resid %>%
    mutate(
      d=(longitude-new_data$longitude)^2+(latitude-new_data$latitude)^2
    ) %>%
    arrange(d) %>% 
    head(n_neighbours) %>%
    lm(resid_smooth~longitude+latitude,data=.) %>%
    predict(newdata = new_data)
  
  predict(linear_model,newdata=new_data) + location_resid
}


# saving the function for Index_Predictor Shiny App
model_coef =
  broom::tidy(linear_model) %>%
  select(term,estimate)

save(predict_rental,linear_model,smooth_location_resid,model_coef,file = "Index_Predictor/modeling_result.RData")

To validate Our model, we sample 100 records in our original dataset, and use the model to predict their gross income pre square feet. However, this is not a cross-validation since excluding these observations has little effect on our model due to large sample number.

n_test=100
error = rep(0,n_test)

for(i in 1:n_test){
  test_sample=
    transformed_rental_income %>%
    slice_sample(n=1)
  predicted = as.numeric(predict_rental(test_sample))
  error[i] = test_sample$gross_income_per_sq_ft[1]-predicted
}
tibble(error=error) %>%
  ggplot(aes(x=error)) +
  geom_density()

Discussion